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This paper introduces a machine learning based collaborative multi-band spectrum sensing policy for cognitive radios. 
The proposed sensing policy guides secondary users to focus the search of unused radio spectrum to those frequencies that 
persistently provide them high data rate. The proposed policy is based on machine learning, which makes it adaptive 
with the temporally and spatially varying radio spectrum. Furthermore, there is no need for dynamic modeling of the 
primary activity since it is implicitly learned over time. Energy efficiency is achieved by minimizing the number of 
assigned sensors per each subband under a constraint on miss detection probability. It is important to control the missed 
detections because they cause collisions with primary transmissions and lead to retransmissions at both the primary 
and secondary user. Simulations show that the proposed machine learning based sensing policy improves the overall 
throughput of the secondary network and improves the energy efficiency while controlling the miss detection probability. 
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1. Introduction 

The increasing demand for wireless services has made 
the usable radio spectrum a scarce and expensive resource. 
Part of the scarcity problem are the spectrum allocation 
policies that do not exploit the fact that the state of the ra- 
dio frequency spectrum is time and location varying. Mea- 
surement campaigns pQ have in fact shown that large parts 
of the spectrum are underutilized because the license hold- 
ers are not using the spectrum or because the fact that 
wireless signals attenuate in 2 — 4 power of distance is not 
fully exploited. Underutilized spectrum is time-frequency- 
location varying resource and radio wave propagation and 
signal attenuation are important factors in determining 
where spectrum opportunities or areas of harmful inter- 
ference occur. Identifying temporal and spatial spectrum 
holes has been the key motivation behind cognitive ra- 
dio (CR) and dynamic spectrum access (DSA) Figure 
[T] illustrates how spectrum holes emerge in time and fre- 
quency. 

CR systems try to use the licensed radio spectrum in an 
agile manner while guaranteeing that the licensed users 
will not be interfered (see figure [l]) . A spectrum opportu- 
nity is a situation in which secondary users (SU) are able 
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Figure 1: Spectrum holes in time and frequency in a given location. 
A spectrum hole emerges when a primary user (activity indicated 
by the blocks) vacates its frequency. Secondary users try to oppor- 
tunistically detect and access these spectrum holes (indicated by the 
green arrows). 



to communicate on a licensed frequency without interfer- 
ing the primary user (PU) and without being themselves 
interfered by the PU [3]. In order to find such spectrum 
opportunities CR systems need to sense the spectrum (see 
figure |2| . 

A CR network can be considered to consist of N$ spa- 
tially distributed wireless terminals that identify free fre- 
quencies across a wide spectrum of interest that is assumed 
to have been divided into Nb subbands. In order to mit- 
igate the effects of fading, cooperative detection schemes 
have been proposed in the literature 0H1E]. This means 
that a part of the spectrum is simultaneously sensed by 
multiple SUs that send their local test statistics to a fusion 
center (FC) which then makes a global decision about the 
state of the spectrum. With such cooperation, the prob- 
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Figure 2: A cognitive radio setting. The SUs are collaboratively 
sensing whether the PUs are active or not. After sensing the spec- 
trum the SUs send their sensing results to a common fusion center 
(FC) that makes a global decision about the state of the spectrum 
and grants access to the spectrum for one of the users if the spec- 
trum is found unoccupied. Cooperative spectrum sensing provides 
spatial diversity to overcome the effects of slow fading caused by 
large objects and fast fading caused by multi-path propagation and 
mobility. 

ability of detection at a given signal-to-noise ratio (SNR) 
is increased, or for equal performance, simpler detector 
structures may be employed. 

An important function performed by the FC is the spec- 
trum sensing policy, which is also the focus of this paper. 
A spectrum sensing policy guides the SUs about who is 
sensing, which part of the spectrum and when. One of the 
main targets of a sensing policy is to select those frequency 
bands for sensing that persistently provide more spectrum 
opportunities and throughput for the SU network. 

1.1. Contribution of the paper 

In this paper a reinforcement learning based multi-user, 
multi-band spectrum sensing policy is proposed. The pro- 
posed sensing policy balances between exploring and ex- 
ploiting different parts of the radio spectrum and different 
sensing assignments. It decides which frequency bands to 
sense as well as which SU is assigned to do the sensing. In 
the exploitation phase the sensing assignment for the high 
throughput subbands is found by minimizing the number 
of assigned SUs subject to a constraint on the miss detec- 
tion probability. Moreover, the probability of false alarm 
is constrained by using Neyman-Pearson detectors. Mini- 
mization of the number of simultaneously sensing SUs im- 
proves the energy efficiency of the battery operated SUs. 
The minimization is formulated as a binary integer pro- 
gramming (BIP) problem that may be solved exactly by 
a branch-and-bound type algorithm or approximately by 
using approximative methods such as the iterative Hun- 
garian method considered in this paper. The proposed 
policy may reduce the number of active sensors up to a 
factor of 1/-D, where D is the diversity order of a fixed 
sensing policy. In the exploration phase different pseudo- 
random sensing assignments with fixed diversity order are 
explored in order to re-adapt to possible changes in the 
PU activity and channel conditions. On one hand, spatial 
diversity improves the detector performance in the face of 
fading and shadowing but on the other hand reduces the 



number of simultaneously sensed frequency bands by the 
secondary network. Cognitive network may use multiple 
idle frequency bands in order to improve rate or reliability 
of the network. 

Some preliminary ideas and results related to this paper 
were presented in [B]. The contributions of this paper are: 

• We propose a machine learning based spectrum sens- 
ing policy for cognitive radio that: 

— provides high throughput for the SUs, 

— reduces missed detections, 

— is energy efficient, 

— is adaptive to non-stationary PU behavior and 
channel conditions. 

• Analytical expressions for the convergence of the pro- 
posed sensing policy in stationary scenarios are de- 
rived. 

• Extensive simulation results highlighting the excellent 
performance of the proposed sensing policy in various 
stationary and non-stationary scenarios are shown. 

• We show that a simple and fast approximative algo- 
rithm based on the Hungarian method may be used 
to find near optimal sensing assignments. 

The main difference with this paper and the related work 
in the literature, in addition to the methodology, is the 
exploitation of the information about the sensing perfor- 
mances of the SUs to optimize the sensing assignments in 
an energy efficient manner. 

This paper is organized as follows. In section [2] the re- 
lated work to this paper is briefly summarized. The sys- 
tem model of cooperative multi-band sensing is described 
in section [3] In section [4] an energy efficient reinforcement 
learning based sensing policy is proposed and analytical re- 
sults on the convergence rate of the Q- values in the sensing 
policy are derived. Section [5] shows and discusses the sim- 
ulation results of the performance of the proposed sensing 
policy. The paper is concluded in section [6] 

2. Related work 

The task of choosing which frequency band to sense may 
be formulated as a restless multi-armed bandit (RMAB) 
problem. In RMAB problems a player bets on L out of 
N slot machines (L > 1, N > L) targeting to maximize 
its long term profit. The term restless comes from the 
fact that also the states of the non-played machines may 
change; similarly as the state of the not sensed frequency 
bands may change in a CR setting. In [3J 171-11 3| spectrum 
sensing policies are derived based on the framework of par- 
tially observable Markov decision processes (POMDPs). 
In [T3] a closed form Whittle index policy for perfectly 
known Markovian reward distributions was derived and 
shown to be optimal under certain conditions. 
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Iii a case where the player does not have prior knowl- 
edge about the reward distributions of the different ma- 
chines (or as in this case about the throughputs of the 
different frequency bands), it is obviously impossible to 
derive optimal action selection policies. In such case ma- 
chine learning is an attractive approach for solving the 
problem. A known issue with machine learning methods 
is the so-called exploitation-exploration trade-off, which 
emerges when the player has to decide whether to try to 
exploit the seemingly best machine (or frequency band) 
at the moment or to explore other machines in hope of 
finding even better one. A standard method for tackling 
multi-armed bandit problems is the Q-learning algorithm 
P3] with e-greedy exploration [T5]. An alternative way for 
balancing the trade-off between exploration and exploita- 
tion is to use confidence bounds. Namely, in [16] a simple 
policy based on upper confidence bounds (UCB) was pro- 
posed and shown to reach the optimal regret rate when the 
rewards are independent and stationary. An UCB policy 
that suits better for non-stationary rewards was developed 
in |17| . In |18j a single-user reinforcement learning method 
was proposed for selecting between 3 future actions: con- 
tinuing sensing at the current frequency band b and trans- 
mitting data, sensing an out-of-band frequency band 6, 
and switching the SU system to an out-of-band frequency 
band b. Action selection is done using the softmax method. 

3. System model 

The SU network consists of N$ cooperating wireless SU 
terminals sensing the radio spectrum. The spectrum of 
interest is assumed to be divided into Nb frequency sub- 
bands that may have different bandwidths and may be 
occupied by different primary operators. The subbands 
may be scattered in frequency Depending on the front- 
end design of the SU device, one SU can sense up to K s 
subbands at a time. 

In this paper it is assumed that the SUs cooperate by 
sending their local binary decisions to a FC, that makes a 
global decision about the availability of the spectrum for 
all SUs. This brings spatial diversity and increased scan- 
ning speed. Spatial diversity is obtained when multiple 
SUs sense the same part of the spectrum simultaneously 
from different locations and then form a global decision. 
Scanning speed is increased since each SU may get sensing 
information about up to ^ s K s subbands simultaneously. 

The SUs are assumed to be synchronized and their oper- 
ation to be divided into sensing mini time slots and poten- 
tial transmission slots as illustrated in figure [3] In a sens- 
ing time slot the SU senses up to K s subbands and then 
sends its local binary decision(s) to the FC via a dedicated 
control channel. The global decisions about the state of 
the sensed subbands is formed at the FC by combining the 
local binary decisions according a fusion rule. 

The FC may be a dedicated node or one or multiple 
nodes could serve as a FC in an ad hoc scenario. A ded- 
icated FC makes a global decision on behalf of all other 
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Figure 3: Time slotting of the SUs' operation. After sensing a par- 
ticular subband the SUs send their local test statistics or decisions 
to the FC that makes the global decision about the state of the 
spectrum (sensing mini slot). Finally the FC grants permissions to 
transmit on the frequencies that were found to be idle (transmission 
slot). 

SUs, whereas individual FCs in an ad hoc scenario could 
make independent decisions based on their own test statis- 
tics and the test statistics received from other SUs. 

One proposed approach to model the PU activity is a 
two-state Markov chain shown in figure [4] [9 . In the model 
state means that the primary subband is idle (PU not 
transmitting) and state 1 that the subband is occupied 
(PU transmitting). However, the policy proposed in this 
paper is not limited to the Markovian assumption. Markov 
model is merely used for illustration purposes in the ex- 
perimental part of this paper. 




Figure 4: The Gilbert-Elliot channel model 1191 . In this paper state 
means that the subband is idle and state 1 that the subband is 
occupied by a PU. 



4. Reinforcement learning based sensing policy 

In the PU network, as in most communication systems, 
the traffic load may vary depending on time and location. 
The expected amount of available radio spectrum for op- 
portunistic secondary use may, for example, be much less 
during rush hours and in densely populated areas than dur- 
ing night time and in rural areas. Also the radio channel 
conditions fluctuate in time depending on location, veloc- 
ity and frequency. Hence, the design of a sensing policy 
for CR has to be approached as a dynamic problem. 

4-1- The e-greedy method 

Let Qkifl) denote the estimated value of action a at 
time step k and a* k denote the selected action at time 
step k. The e-greedy policy is an ad-hoc method that 
balances between exploration and exploitation by select- 
ing the action that has the highest estimated action value, 
i.e. a* k — argmax Q Qfc(a), with probability 1 — e, or a ran- 
dom action, uniformly, with probability e regardless of the 
action- value estimates ]1">I. 

The e-greedy method is a simple and robust method 
that has minor computation and memory requirements. 
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The random exploration phase allows replacing the ran- 
dom action selections with more carefully designed pseu- 
dorandom action selection with desired properties that are 
described in detail in section 14.2.11 

After taking action a reward r(a) is collected after which 
the Q-value of action a is updated as [TS] 



Qk+i(a) = Qk(a) + a k [r k+1 (a) - Q k {a)} 



(1) 



where r k +i{a) is the reward at time step fc + 1 for taking 
action a and a k (0 < a k < 1) is a step size parameter. 

In a stationary scenario convergence is guaranteed with 
probability 1 when the step size parameter a k satisfies the 
following conditions |15j 



k=l 



a k = oo 



and 



fc=i 



< oo. 



(2) 



The first condition in Q guarantees that the step size 
is large enough to overcome the initial conditions, while 
the second condition guarantees that the step size is 
small enough to assure eventual convergence. Step size 
a k = l/(fc + 1) fulfills the conditions of Q and results in 
the standard sample-average of the past rewards. On the 
other hand, for constant a k = a the estimates will never 
completely converge, but continue varying in response to 
the latest observed rewards. In case of tracking a non- 
stationary process this is in fact desirable since the policy 
should react rapidly to the changes in subband occupancy 
statistics. A constant a k = a results in a weighted average 
of the observed rewards, i.e. [15] 



fc+i 



Q*+i(o) = (l-cO fe+1 Q (a)+$>(l-« 



k+l—i 



n(a). (3) 



i=l 



A constant step size a is suitable for tracking non- 
stationary processes such as the channel qualities in CR 
networks. It can be noticed in (J3j that when a is large 
more emphasis is given on the most recent rewards whereas 
when a is close to the algorithm will give emphasis on re- 
wards obtained in the more distant past as well. This sug- 
gests that for heavily non-stationary processes large values 
of a would be more suitable, whereas for stationary pro- 
cesses small a would give better results. 

4-. 2. The proposed sensing policy 

In this paper we propose a sensing policy using e-greedy 
exploration for selecting the frequency subbands to be 
sensed and for selecting the corresponding sensing assign- 
ments in a CR network. The policy is managed by the 
FC that tracks two kinds of Q-values: the Q-values for 
the subbands and the Q-values of all SUs to all subbands. 
A natural way to define the reward r k +\(b) for selecting 
subband b to be sensed is the obtained throughput: 

|i?fc+i(6), if b is accessed and free 

r/b+iW = < n .... . , (4) 

0, it b is occupied, 



where R k+ i(b) is the instantaneous throughput on sub- 
band b. In this paper it is assumed that the SU who has 
been granted the permission to access the band will feed 
back an estimate of the achieved throughput. For exam- 
ple, this may be an estimate based on the measured chan- 
nel quality between the communicating SUs. Using this 
feedback the FC updates the Q-values of each subband 
according to 0. 

The SU Q-values for particular subbands are updated 
by comparing the SUs' decision to the global decision: 



r k+ i{s,b) 



d k+1 (s,b), 
Qk{s,b), 



4+i (FC, b) 
4+i (FC, b) 



(5) 



where dk+i (s,b) denotes the local decision by SU s for 
subband b at time instant fc+1 and d k+ i(FC, b) denotes the 
corresponding decision at the FC. The SU's Q-value is then 
updated again according to (JTJ. Hence, the SU's Q-value 
indicates its sensing performance at subband 6, assuming 
that the global decision based on the local decisions from 
multiple SUs made at the FC is correct. 

After all the Q-value updates, with probability 1 — e the 
FC exploits its knowledge and selects L subbands to be 
sensed that have the highest Q-values (stage 1 in figure 
[5j. In this paper it is assumed that the FC has an esti- 
mate of the desired throughput and is able to select the 
parameter L appropriately. After selecting the subbands 
the FC finds an appropriate sensing assignment for them 
(stage 2 in figure [5]). With probability e the sensing is done 
according to predefined pseudorandom frequency hopping 
codes with a fixed diversity order D, where D is the num- 
ber of SUs simultaneously sensing the same subband. In 
the exploitation phase the sensing assignment is the one 
that minimizes the number of sensings in the SU network 
while maintaining the detection performance at a desired 
level. Finally, the FC sends to the SUs information about 
which subbands they should sense. 

4-2.1. Exploration 

In this section the pseudorandom frequency hopping 
based sensing policy proposed in |5D] is briefly summa- 
rized, since it constitutes the exploration phase of the sens- 
ing policy developed in this paper. The pseudorandom fre- 
quency hopping based sensing policy provides quick scan- 
ning of the spectrum of interest with minimal control sig- 
naling, thus being extremely suitable for exploring the 
spectrum. The frequency hopping code design allows for 
trading off scanning speed and diversity (and consequently 
detector performance) in an elegant manner. Moreover, by 
guaranteeing the desired diversity order D, reliable perfor- 
mance is ensured in demanding propagation environments. 
In the pseudorandom frequency hopping based multi-band 
spectrum sensing policy the design of the sensing policy 
has been converted into designing and allocating pseudo- 
random frequency hopping codes to the SUs guiding them 
which subbands are sensed and when. After each hop- 
ping code period different D-tuples of the N$ SUs will be 
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Figure 5: Flow diagram of the operation of the SU network and 
the FC. The blocks concerning the proposed sensing policy are high- 
lighted with shading. In the diagram rand is uniformly distributed 
between and 1. 



employed to scan the spectrum of interest together. The 
design is made such that over time all possible SU combi- 
nations of size D will be employed to sense each subband. 
Fig. [6] shows an example design of the hopping codes for 
N s = 4, N B = 3 and D = 2. 
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Figure 6: Pseudorandom frequency hopping codes for N$ = 4, Ng = 
3 and D = 2. At each sensing instance the SU senses the subband 
pointed by the current entry of its hopping code. Hopping code 
entries are pointing to the physical frequencies that are maintained 
in a lookup table. The possible transmission slots following after 
each sensing instance have been dropped out for convenience. 

In frequency hopping based sensing each SU hops ac- 
cording to its hopping sequence to sense one of the sub- 
bands of interest. The subband to be sensed at time in- 
dex % is given by f(i) = F[S q (i)], where S q (i) is the qth 
frequency hopping sequence, F is a table containing the 
mappings to the physical subbands. Table F may include 
links to the subbands' center frequencies and bandwidths. 
It is assumed that F is same for all SUs in the network. 

Since it is desirable to scan as much spectrum as possible 
at once, the hopping sequences are made orthogonal. The 
simplest way to generate an orthogonal code family is to 
cyclically shift any full sequence of integer numbers. A full 
sequence is a sequence that contains all integer numbers 
up to a certain number. Cyclic shifts may be generated by 
the modulo operation as 



S g (i) = (i + A q ) modJVfi, 



(6) 



where i € [0, N B - 1], q € [0, J - 1] and A q is the shift 
parameter. For more information about the choice of A q 
and the design of the frequency hopping sequences as well 



as simulation results see |20| . 

4-2.2. Exploitation 

In many practical scenarios the cooperating SUs, al- 
though being in the vicinity of each others, may be in 
very different channel conditions due to fading. Then the 
cooperation among the SUs may be optimized better in 
order to save energy of the SUs. 

Assume that the secondary network of Ng SUs wants 
to sense L < Nb subbands in hope of spectral opportuni- 
ties. These subbands have been selected in the first stage 
of the sensing policy as the ones that are most likely going 
to produce high reward (throughput) for the SU network. 
Denote the set of all the chosen L subband indices as B 
and the set of all SU indices as S. Furthermore, assume 
that the SU network has knowledge about the SUs' prob- 
abilities of detection P^, where s £ S and b £ B. In order 
to conserve the SUs' energy, we would like to minimize 
the number of SUs assigned for sensing while pursuing to 
guarantee a desired level of detection performance at the 
subbands of interest. Hence, the sensing assignment prob- 
lem (SAP) can be formulated as 



mm 

x 



s.t. 



beB ses 

pb (y-\ < pb 

miss,FC\ ) — miss.target 

2J x s b < K s 

beB 

x s b e {0,1}, 



where K s is a positive integer corresponding to the number 
of subbands SU s can sense simultaneously and w s is the 
weight of user s. X = [x s b] is Ng x L the unknown binary 
sensing assignment matrix. The elements of X are 



X s b 



, if SU s is assigned to sense subband b 
, otherwise 



(8) 



In equation ^ Pmiss FC@Q i s the estimate of the miss 
detection probability at the FC at subband b obtained 
with sensing assignment X. P^ liss target is the maximum 
probability of miss detection that the secondary network 
is allowed to have at band b. The first constraint in ^ 
requires that the probability of miss detection at the FC 
should be below the constraint, whereas the second con- 
straint restricts the number of subbands SU s can sense 
simultaneously to be K s or less. The weight w s of SU s 
may be chosen, for example, according to the SUs' battery 
charge. If SU s is known to have low battery charge it may 
be given relatively large weights compared to other users 
so that it will unlikely to be assigned for sensing. 

There are many ways to design distributed detection 
such that the detection performance constraint in ^ 
is met. As an example we consider here hard deci- 
sion combining of multiple Neyman-Pearson detectors [21 . 
Neyman-Pearson detectors maximize the detection prob- 
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ability under a constraint on false alarm rate and hence 
false alarm rate is not included in Q as a separate con- 
straint. Typically the false alarm rate constraint is set 
small since false alarms equal to overlooked spectral op- 
portunities. For hard decision combining, such as the OR- 
rule considered in the next subsection, the false alarm rate 
constraint at the FC is simply met by controlling the lo- 
cal detection thresholds according to the number of SUs 
assigned to sense the same subband [21 . 

4-2.3. Sensing assignment for the OR-fusion rule 

Next the sensing assignment is illustrated for the OR- 
rule where the SUs send only their local decisions to the 
FC. The FC then decides a subband to be free only if all 
sensing SUs have reported it to be free. Other fusion rules 
such as if-out-of-iV-rule could be used as well. Assuming 
conditional independence of the observations at different 
SUs given Hq or Hi the probability of missed detection at 
the FC at subband b for the OR-rule is given by 



AT, 



pb 

± miss,FC 



(9) 



which as such would lead to a nonlinear constraint in the 
SAP given by equation ([7]). However, the detection per- 
formance constraint can be linearized by simply taking the 
logarithm of the missed detection probabilities, i.e. 



hp; 



miss.FC ) 



J>(1 



Psb)%s 



(10) 



Then the SAP for the OR-rule can be formulated as a 
linear binary integer programming (BIP) problem as 



mm 

X 

S.t. 



T 
W X 



(11) 



Ax < c 

x is binary, 



where w is an NgL x 1 vector of weights for the SUs 
at different subbands, x = vec(X) is a binary vector 
of size N$L x 1, A is the (L + Ns) x LN$ constraint 
matrix containing the logarithms of the estimated lo- 
cal miss detection probabilities ln(l — P s b)'s and L iden- 
tity matrices at the bottom and c is the vector of 
the constraints. The constraint vector is given as c = 

[MPLss,target)i • ■ ■ . H P Ls S ,target ) > K l » -,K Ns ] T . Since 

the detection probability can be known only up to a certain 
margin of error, the constraint vector c should in practice 
include a safety margin defined by a spectrum regulator. 
The constraint matrix A is given by 



p 1 ■ 



p L 



(12) 



where p b mtss = [ln(l - P lb )M(l - P 2b ), ...,ln(l - P Nsb )} 
and Ijv s is the identity matrix of size Ns- 

This BIP problem is NP-hard but solvable by branch- 
and-bound (BB) type algorithms. The worst case running 
time of BB search, although unlikely, is 2 NsL , that cor- 
responds to the case where no branching is possible. In 
practice N$L maybe assumed to be small. 

In cases where the product N$L is large, the proba- 
bility that there exists multiple near optimal assignments 
is high. In such cases heuristic approximation algorithms 
may be applied. In |22| an iterative Hungarian algorithm 
is proposed to find a sensing assignment that minimizes 
the probability of miss detection. The policy assigns SUs 
to sense the subbands one by one using the Hungarian 
method [23 . In our problem formulation, the Hungarian 
method can be employed iteratively, similarly to |22j . to 
find a near optimal solution for the SAP with w s = 1 by 
modifying the algorithm to stop immediately once a feasi- 
ble solution is found. Since the Hungarian algorithm runs 
in polynomial time, this method is also polynomial time. 

4-3. SU Q-value and the local detection probability 

Solving the optimization problem of Q requires the es- 
timates of the probabilities of missed detection at the FC. 
Defining the reward as in (|5| provides simultaneously a 
simple estimate for the SUs' probabilities of detection. 

Since the SU Q-values are updated according to equa- 
tion similarly to the subband Q-values, it can be 
shown that the asymptotic expected Q-values E[Qfe(s,6)] 
approach the expected reward as k — > oo. From equation 
(pH) assuming that E[Qk+i(s,b)] = E[Q k (s,b)} we get 

k— >OQ 

lim E[Q k+1 (s,b)} = lim E[r fc («,&)] = 

k— >oo k^-oo 

P{P(d(FC) = 1 n d(s) = 1\H{) 

PlPd,FC + P Pf,FC 

P Q P(d(FC) = ind(s) = l\H ) 

PlPd,FC + P Pf,FC 

where Pq is the probability of the subband being free, Pi = 
1 — Pq, d(s) and d(FC) are the decision at SU s and at the 
FC, respectively, and Pd,FC an d P},fc are i respectively, 
the probabilities of detection and false alarm at the FC. 
For notational convenience the subband index b has been 
dropped. 

For the OR-rule P(d(FC) = 1 H d(s) = l|#i) = 
P(d(s) = l\Hi) = P d , s and ¥{d(FC) = lnd(s) = l|#o) = 
P(d(s) = 1\H ) = Pf, s , since P(d(FC) = l\d(s) = 1) = 1. 
Then, 



+ 



lim EtQfc+i (*,&)] = 



0*7,1. 



PlPd,FC + P()Pf,FC Pd.FC 



assuming that PoPf.s ~ and PqP/^fc ~ 0. It can be 
seen that in order for the local detection probability esti- 
mates to be close enough to the detection probability at 
the FC Pd fc should be close to one. This can be achieved 
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Figure 7: The mean converged SU Q-values after 10000 sensing in- 
stances ordered according to the corresponding mean SNR and the 
true probability of detection curve. The mean was calculated over 
100 runs. It can be seen that the SUs' Q-values (red circles) coincide 
well with the true probabilities of detection (blue curve). 



through spatial diversity if the decision at the FC is based 
on multiple SUs' local test statistics or decisions. 

Figure [7] shows converged SU Q-values ordered by mean 
SNR and the true probability of detection curve. The de- 
tection scheme is Neyman-Pearson energy detection with 
a sample size 50 and P/,fc — 0.01. The fusion rule is the 
OR-rule with D = 2 and a = 0.1. It can be seen that the 
Q-values align with the true probabilities of detection. 

4--4- Convergence of the subband Q-values 

Since all the subbands are not necessarily sensed all the 
time, we need to introduce another time variable T k (b) < k 
denoting the number of sensing instances (and value up- 
dates) at band b up to the fcth run of the e-greedy algo- 
rithm. Regrouping the components in equation (|T|) the 
Q- value of subband b can be expressed as 

QT k (b)+i(b) = (1 -a)Q Tk(b) (b) +ar k (b). 

Taking the expectation of both sides results to 

E[Q Tfe(6)+1 (6)] = (1 - a)E[Q Tk(b) (b)] + afi(b), 

where /i(b) = E[r k (b)]. This is a linear recurrence, whose 
solution is given by 

E[Q Tk(b)+1 (b)} = a T ^ +1 E[Q (&)] + (l - (1 - af^ +1 ) »(b). 

Assuming E[Qo(^)] = the expected Q- value of band b at 
the T/ C (6)th update is 

E[Q Tfe(6 )(&)] = (1 - (1 ~ a) T ^Mb) = n{b). 

as Tk(b) — > oo. Then the expected Q-value of band b after 
the kth run of the e-greedy algorithm is given by 

k 

E[Q k (b)}=n(b) Y, P(T fe (6))(l~(l-a) T ^ b )), (13) 

T fc (b)=0 



Figure 8: The simulated convergence of the expected Q-value (black 
solid curves) of 5 subbands and the upper (red dashed curve) and 
lower bounds (blue dashed curve) for them. The number of sensed 
bands is set to L = 1. The rewards are assumed to be Bernoulli 
distributed with probability 0.5 and means fi(l) = A»(2) = fi(3) = 
/i(4) = 1 and ^(5) = 10. The used parameters are e = 0.1 and the 
step size a = 0.1. It can be noticed that the convergence of the Q- 
values of the subbands with the lowest mean rewards follows closely 
the lower bound, where as the convergence at the best subband goes 
closer to the upper bound. 



where P(T k (b)) is the probability that band b has been up- 
dated T k (b) times within the k runs of e-greedy algorithm. 

Upper and lower bounds can be easily obtained for 
P(T k (b)) in a stationary case: 



B(T k (b), k, e A) < P(T fe (6)) < B(T k (b), k, 1 - e(l 



N, 



where B{T k (b), k,p) = ( Tk \ b) )p Tk{b) {I - p) k ~ Tk{b) is the bi- 
nomial probability density function. The lower bound cor- 
responds to the probability that the Q-value is updated 
only in the exploration phase and the upper bound to the 
case that in the exploitation phase the Q-value is updated 
with probability one. 

The analysis for the convergence of the SU Q-values is 
almost identical to the analysis above for the Q-values 
of the subbands. The probability Y(T k (b, s)) that SU s 
has sensed subband b during k runs T k (b, s) times is then 
bounded as 



B(T k (b, s),k,e 



L 

W B 



< P(T k (b, a)) < B(T k (b, s), k, i- e (l-j^)). 



Establishing a lower bound for the probability of the num- 
ber of sensings is important for guaranteeing a desired con- 
vergence rate for the estimates of the probability of missed 
detections in the second stage of the proposed sensing pol- 
icy. 

Figure [8] shows the simulated convergence of the ex- 
pected Q-values of 5 subbands and the upper and lower 
bounds for them. The number of sensed bands is set to 
L = 1. The rewards are assumed to be Bernoulli dis- 
tributed with probability Pq = 0.5 and means /i(l) = 
H{2) = /x(3) = /i(4) = 1 and /x(5) = 10. 



5. Simulation examples 

In this section simulation results for the proposed sens- 
ing policy are shown. The main focus is put on the ob- 
tained throughput of the secondary network and miss de- 
tection probability. 

5.1. Stationary case 

This subsection provides the results for a stationary sce- 
nario in which the occupancy statistics of the primary 
bands stay constant during the whole simulation period. 
The results are shown for the throughput, average miss 
detection probability and relative number of sensings in 
the SU network with different values of e. Furthermore, 
the simulations are shown for comparison using the ex- 
act BB search and an approximative iterative Hungarian 
(IH) method adapted from [22 . In the stationary case the 
mean detection performances of the SUs remain constant. 
The simulations are done for N$ = 6 SUs and Np = 10 
primary subbands. The availability of each subband is 
modeled according to a two state Markov chain (see fig- 
ure [4]) with state transition probabilities Pn = Poo = 0.9. 
Different subbands are assumed to be independent of each 
other. The mean SNRs of the primary signal in the sec- 
ondary network is assumed to be distributed according to 
the log-normal shadow model with a standard deviation 
of 9 dBs. The fast fading component of the channel is 
modeled as a block fading Rayleigh channel with expected 
power gain of 1. Furthermore, it is assumed that 3 of the 
subbands are able to provide 10 times higher throughputs 
on average. For spectrum sensing Neyman-Pearson energy 
detection with a sample size of 50 is used. The global deci- 
sions at the FC are formed using the hard decision OR-rule 
with a constant false alarm rate P/,fc = 0.01. In the ex- 
ploration phase the pseudorandom frequency hopping code 
design is made using a fixed diversity order D = 2 that has 
been selected such that on average the desired miss detec- 
tion probability is close to P m iss,target- In the exploitation 
phase the number of subbands SUs can sense simultane- 
ously is set to K s = 1, Vs G S, and the target probability 
of miss detection at the subbands P m iss,target{b) — 0.1. 
The weights w s in the SAP have been set to 1 for all SUs. 
The number of subbands that the SU network wants to 
find is constant during the whole simulation, i.e., L — 3. 
For clarity in this section the step sizes in the first and sec- 
onds stage of the sensing policy are denoted as ai and ui 
respectively. In the simulations ct\ — 0.01 and a2 = 0.1. 

Figure [9] shows the cumulative throughput relative to 
an ideal, genie aided policy. An ideal policy is assumed to 
be able to find all spectrum opportunities and select the 
L subbands with highest instantaneous throughputs. The 
obtained throughput using the exact BB search and the 
throughput using the heuristic IH method are practically 
the same. However, in this case the IH method found the 
assignment on average 80 times faster than the BB search. 
It can be noticed that with e = 0.1 the proposed policy is 
finally obtaining 83% of the throughput of an ideal policy. 
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Figure 9: Cumulative throughput relative to an ideal policy. The 
dashed curves are with the IH method and the solid curves with the 
exact BB algorithm. For e = 1 the curves are exactly the same as 
it corresponds to exploration only case. For e < 1 cases the perfor- 
mances are almost the same. For example with e = 0.1 the proposed 
policy is able to provide about 83% of the throughput of an ideal 
policy. It can be seen that with small e the converged throughput is 
high, whereas the convergence rate is slow. 

Miss detection probability 
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Figure 10: Convergence of the missed detection probability for dif- 
ferent choices of e averaged over the subbands. The dashed curves 
are with the IH method and the solid curves with the exact BB al- 
gorithm. For e = 1 the curves are exactly the same as it corresponds 
to exploration only case. For e < 1 cases the performances are al- 
most the same. It can be seen that over time for e < 1 the missed 
detection probability converges below P m iss,target = 0.1. Again for 
small e the converging value is compromised with slow convergence 
rate. 

As can be seen the trade-off with small e comes naturally 
with a slower rate of convergence. 

Figure [10] shows the probability of miss detection for 
different choices of e. The diversity order for the fixed 
policy (curve corresponding to e = 1) was selected such 
that on the average the target miss detection probability 
is achieved. The resulting average miss detection proba- 
bility using the exact BB search and using the heuristic 
IH method are almost the same. When e is decreased the 
policy starts assigning those SUs with high probability of 
detection to sense the corresponding subbands more often 
thus decreasing the overall number of miss detections. For 
e = 0.1 and e = 0.3 the average miss detection probability 
is finally at the end of the simulation close to 0.04. 
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Figure 11: Number of sensings relative to a sensing policy with fixed 
diversity order D = 2. The dashed curves are with the IH method 
and the solid curves with the exact BB algorithm. For e = 1 the 
curves are exactly the same as it corresponds to exploration only 
case. For e < 1 cases the performances are almost the same. It can 
be seen that the proposed policy reduces the number of sensings in 
the secondary network by assigning only the best SUs for sensing. 
With e = 0.1 the number of sensing SUs per subband is almost 
halved, meaning that on average in the exploitation phase there is 
only one SU sensing per subband. 



Figure [TT] shows the number of sensings over time com- 
pared to a sensing policy with fixed diversity order D = 2 
(exploration only). The savings in the number of sens- 
ings using the exact BB search and using the heuristic IH 
method are again practically same. For the case e = 0.1 
the number of sensings and transmissions of the local sens- 
ing results to the FC are reduced to 56%. 

5.2. Expected throughput for non-stationary cases 

For a non-stationary scenario the throughput of the pro- 
posed sensing policy is compared against two other meth- 
ods. The results are shown only for the first stage of 
the proposed sensing policy that attempts to maximize 
the throughput of the secondary network. The results are 
shown for a case in which the availability of the subbands 
is Markov process and for a case in which the availability 
is a Bernoulli process (i.e. a special case of a two-state 
Markov chain). Moreover, the proposed policy is com- 
pared to two other state-of-the-art policies. Namely, the 
comparison is done against the discounted UCB (DUCB) 
policy with a discount factor 7 [T7] and a near-optimal 
sensing policy [13], the Whittle index policy, that assumes 
the state transition probabilities in the Markov chain to be 
known. The comparison between the Whittle index policy 
and the two machine learning-based policies is therefore 
not entirely fair since the assumptions about prior knowl- 
edge are different. 

Here the number of subbands has been set to Nb = 5 
and the number of simultaneously sensed bands to L = 1. 
The missed detection probability at the FC is assumed to 
be Pmiss.FC = 0.1 and the false alarm rate Pf,FC — 0.01 
using Neyman-Pearson detectors. The mean through- 
puts of the bands bands are [11,21,31,41,51]. In the 
first scenario the transition probabilities of the Markov 
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Figure 12: The sum throughput of the secondary network over time 
in a non-stationary scenario with Markovian rewards. The through- 
put statistics are permutated randomly among the subbands at ran- 
domly selected time instants that show up as sudden deep drops 
in the expected throughput. It can be seen that the DUCB pro- 
vides high throughput in the beginning when all the discounted mean 
throughputs in the algorithm are zero. However, the proposed sens- 
ing policy with e-greedy exploration seems to provide more stable 
convergence at all times. Naturally the Whittle index based policy 
has the best convergence at all times, since it is assumes that the 
throughput distributions are known. 



chain are initialized as P o = [0-5, 0.9, 0.6, 0.8, 0.8] and 
Pn = [0.9,0.31,0.7,0.9,0.3]. In the Bernoulli case the 
probabilities of the subbands being free are initialized 
as P = [0.87,0.17,0.43,0.33,0.78]. To simulate non- 
stationary behavior the transition probabilities and the 
mean rewards are randomly permutated among the sub- 
bands at random time instances. 

Figure [12] shows the expected mean throughput for the 
non-stationary Markovian case. Since it is assumed that 
the Whittle index policy knows the throughput distribu- 
tions perfectly at each time, the optimized policy is nat- 
urally giving the highest throughput. It can be seen that 
DUCB adapts fast at the beginning when the discounted 
mean throughputs in the algorithm have been set to zero. 
However, after the first change in the throughput dis- 
tributions the convergence of DUCB slows down signifi- 
cantly. The proposed sensing policy with e-greedy explo- 
ration seems to provide more consistent convergence at all 
times. 



Figure 13 shows the expected throughput for the non- 
stationary Bernoulli case. Here only results are shown 
for the two machine learning-based policies, since neither 
of them does not assume any prior knowledge about the 
underlying Bernoulli process. The two machine learning 
based sensing policies perform almost alike as in the first 
non-stationary scenario, with the proposed policy giving 
slightly better overall performance than the DUCB policy. 



6. Conclusions 

In this paper a machine learning based multi-band spec- 
trum sensing policy is proposed. In the proposed policy the 
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Figure 13: The sum throughput of the secondary network over time 
in a non-stationary scenario with Bernoulli rewards. The throughput 
statistics are rotated among the subbands randomly at randomly se- 
lected instances that show up as sudden deep drops in the expected 
throughput. The DUCB performs again well in the beginning but 
degrades notably after the first change in the throughput distribu- 
tions. 



e-grccdy method is employed to track the occupancy statis- 
tics of the PU and to estimate the detection performance 
of the SUs. Using the e-greedy method the proposed pol- 
icy exploits the gained knowledge about the throughputs 
of different subbands by selecting the sensed subbands as 
the ones with the highest Q-value. Furthermore, knowl- 
edge about the detection performances of different SUs is 
exploited by minimizing the number of SUs assigned for 
sensing that are collaboratively able to meet a desired miss 
detection probability threshold. 

Exploration of the radio spectrum and different sens- 
ing assignments is realized using pseudorandom frequency 
hopping codes with fixed diversity order. Firstly, the pseu- 
dorandom exploration with fixed diversity order guaran- 
tees reliable sensing, and secondly, eventually all possible 
SU combinations of size D will be considered. 

In the exploitation phase the sensing assignment prob- 
lem is formulated as a binary integer programming prob- 
lem in which the objective is to minimize the number of 
sensors D per subband while ensuring the desired detection 
performance at each subband. By minimizing the number 
of sensing SUs per subband energy of the battery operated 
users is conserved and the amount of transmitted local test 
statistics is reduced. The optimal sensing assignment may 
be found by using exact branch-and-bound search or an 
approximative algorithm such as the iterative Hungarian 
method. 

In this paper we demonstrate the performance of the 
proposed sensing policy and derive analytical expressions 
about the convergence of the policy. The simulation results 
show that the proposed sensing policy provides excellent 
performance in terms of throughput, detection probability 
and energy efficiency. 
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